The code for this analysis is published in a public Git Hub repository.
Friends is an American situation comedy, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast starring Jennifer Aniston (Rachel), Courteney Cox (Monica), Lisa Kudrow (Phoebe), Matt LeBlanc (Joey), Matthew Perry (Chandler) and David Schwimmer (Ross).
The show revolved around six friends in their 20s and 30s who lived in Manhattan, New York City. Rachel Green, a sheltered but friendly woman, flees her wedding day and her rich yet unfulfilling life, and finds childhood friend Monica Geller, a tightly-wound but caring chef. After Rachel becomes a waitress at coffee house Central Perk, she and Monica become roommates at Monica’s apartment located directly above Central Perk, and Rachel joins Monica’s group of single people in their mid-20s: her previous roommate Phoebe Buffay, an eccentric, innocent masseuse; her neighbor across the hall Joey Tribbiani, a dim-witted yet loyal struggling actor and womanizer; Joey’s roommate Chandler Bing, a sarcastic, self-deprecating IT manager; and her older brother and Chandler’s college roommate Ross Geller, a sweet-natured but insecure paleontologist.
Friends received positive reviews throughout its run and became one of the most popular sitcoms of its time. The series won many awards and was nominated for 63 Primetime Emmy Awards. The series was also very successful in the ratings, consistently ranking in the top ten in the final primetime ratings. Friends has made a large cultural impact, and has become an the model to follow for sitcoms.
As teenagers at the beginning of the century, we were heavily influenced by the Friends phenomenon and became huge fans of the sitcom. We decided to work on this project to challenge through a data analysis our preconceptions of the show and discover hidden insights. The questions that guide our quantitative assessment are the following:
Can we categorize by importance all the appearing characters of the sitcom? This question at first glance could seem simple but under the assumption that we do not possess any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis represents a challenge.
Can we identify and quantify the interactions between the main and secondary characters? What would be an appropriate way to quantify and visualize these relationships?
Which are the most recurrent topics through the seasons and episodes of the show? And how the thematic of the show evolved over its ten seasons? Can we extract this information through the dialogues of the show?
Can we determine the contribution of each character to the popularity of the sitcom? Does the participation of each character influence the viewer’s preferences?
We have use the next R libraries for the development of this project :
For data extraction and manipulation: dplyr, rvest, robotstxt,base, Hmisc
For data visualization: ggplot2, ggthemes, ggrepel, visNetwork, d3
Furthermore, for run some of the analysis and visualizations we have used some Machine Learning (ML) techniques and other statistical tools such as: Kmeans analysis(cluster), Graph and network analysis(igraph), Topic analysis(textmineR, stopwords)
The primary data sources that we used for our project and that we consider that have an adequate quality are:
Transcripts: For the transcripts, we used an open resource built by fans of the sitcom and that has been compiled in a Git hub repository. The repository contains all the dialogues of the characters for the 231 episodes of the tv-show. The data is organized in Html documents.. The data can be accessed via: https://fangj.github.io/friends/. If you want to see how the transcripts are originally presented please click here.
Ratings: For the ratings, we have used the IMDb Datasets which is available for access to customers for personal and non-commercial use. The data is structured in seven compressed CSV files that contain general information of the show (genre, start year, end year, episode duration, etc.), and specific information of each episode (title, rating, characters, crew, etc.). A relevant characteristic of the database is that it is refreshed daily. We have made the consultation of the Data on November 10, 20199. The data can be accessed via: https://datasets.imdbws.com/
IMDb Dataset:
The first obstacle that we faced with the IMDb datasets was the size of the data sets, some of them have millions of rows with the information of Tv-series, shorts, movies, documentaries, and other entertainment formats. It was impossible to store them in our Git hub.
The second obstacle was to track, which was the data corresponding to our case of study. For example, we searched in the dataset only by name ‘Friends’ we found 178 results of TV-series or movies called ‘Friends’. It was necessary to understand and do some research on the years of beginning and end of the series to refine the search.
Another obstacle was that the ID for TV-series across the seven IMDb datasets was not uniform. For example, in the dataset corresponding to the titles of the TV-series, the ID to identify the show is named “tconst”, while on the dataset that where we can get the ID of the episode the name correspond to the ID of the episode, and the ID for the TV-series is called “parentTconst”. These errors were identified through the exploration of the datasets.
Transcript Dataset:
The main obstacle of the dialogue dataset is that not all the HTML files share the same format. We have overcome this difficulty by incorporating special cases in our scraping code that took into account the special cases that we have detected.
The second difficulty that we have experienced in the dialogue dataset is the cleaning of the dialogues itself. We tried to standardize as most as possible the content of the dialogues, by identifying different names for the same character, common typos and regular expressions that could hinder our analysis.
You can follow the scraping code that lead to the following data frame by looking into “friends.Rmd” file in the Git Hub repository.
url <- "https://fangj.github.io/friends/"
paths_allowed(url)
## # A tibble: 6 x 5
## episode_id line_num scene character line
## <chr> <dbl> <dbl> <chr> <chr>
## 1 1 : 01 1 1 MONICA There's nothing to tell! He's just s…
## 2 1 : 01 2 1 JOEY C'mon, you're going out with the guy…
## 3 1 : 01 3 1 CHANDLER All right Joey, be nice. So does he …
## 4 1 : 01 4 1 PHOEBE Wait, does he eat chalk?
## 5 1 : 01 5 1 PHOEBE Just, 'cause, I don't want her to go…
## 6 1 : 01 6 1 MONICA Okay, everybody relax. This is not e…
## # A tibble: 6 x 2
## line words
## <chr> <int>
## 1 There's nothing to tell! He's just some guy I work with! 11
## 2 C'mon, you're going out with the guy! There's gotta be something w… 14
## 3 All right Joey, be nice. So does he have a hump? A hump and a hair… 16
## 4 Wait, does he eat chalk? 5
## 5 Just, 'cause, I don't want her to go through what I went through w… 16
## 6 Okay, everybody relax. This is not even a date. It's just two peop… 21
We can see that some episodes where put together in the same file:
## [1] "2 : 12-13" "6 : 15-16" "9 : 23-24" "10 : 17-18"
Now we split those episodes into two different ones:
For more clarity, we will add season and episode columns.
Now we have to correct some character names that had typos and we removed some lines that the scraping code catches that are not dialogues.
## # A tibble: 6 x 8
## episode_id line_num scene character line words season episode
## <chr> <dbl> <dbl> <chr> <chr> <int> <int> <int>
## 1 1 : 01 1 1 MONICA There's nothing… 11 1 1
## 2 1 : 01 2 1 JOEY C'mon, you're g… 14 1 1
## 3 1 : 01 3 1 CHANDLER All right Joey,… 16 1 1
## 4 1 : 01 4 1 PHOEBE Wait, does he e… 5 1 1
## 5 1 : 01 5 1 PHOEBE Just, 'cause, I… 16 1 1
## 6 1 : 01 6 1 MONICA Okay, everybody… 21 1 1
More data transformation will be used and explained in each of the Results subsections.
+ Step 1.2: Extract, decompress and save as dataframes. Tha 3 tables of IMDb that we have used in our analysis are those that allowed us to extract the information related to the rating of each episode.
title.ratings.tsv.gz
## 'data.frame': 990485 obs. of 3 variables:
## $ tconst : Factor w/ 990485 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ averageRating: num 5.6 6.1 6.5 6.2 6.1 5.2 5.5 5.4 5.4 6.9 ...
## $ numVotes : int 1547 187 1204 114 1932 102 615 1663 81 5539 ...
title.episode.tsv.gz
## 'data.frame': 4425501 obs. of 4 variables:
## $ tconst : Factor w/ 4425501 levels "tt0041951","tt0042816",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ parentTconst : Factor w/ 128694 levels "tt0038276","tt0039122",..: 59 34810 34810 22 34810 34810 34810 34274 34810 34810 ...
## $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 2 2 1 131 103 103 131 2 131 179 ...
## $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 14445 6320 1 9103 6209 13326 7769 11103 9547 1115 ...
## 'data.frame': 236 obs. of 9 variables:
## $ parentTconst : chr "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
## $ titleType : Factor w/ 1 level "tvSeries": 1 1 1 1 1 1 1 1 1 1 ...
## $ primaryTitle : Factor w/ 1 level "Friends": 1 1 1 1 1 1 1 1 1 1 ...
## $ tconst : chr "tt0583431" "tt0583432" "tt0583433" "tt0583434" ...
## $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 212 3 3 3 223 3 190 201 103 190 ...
## $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 13326 14445 6320 6431 3 3 3 3 2226 7769 ...
## $ averageRating: num 8.2 8.6 9.5 9.7 8.7 8.5 8.9 8.7 8.6 8.8 ...
## $ numVotes : int 2568 2641 5829 9699 2783 2889 3376 2962 3472 3100 ...
## $ episode_id : chr "7 : 08" "10 : 09" "10 : 17" "10 : 18" ...
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61264 obs. of 12 variables:
## $ episode_id : chr "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
## $ line_num : num 1 2 3 4 5 6 7 8 9 10 ...
## $ scene : num 1 1 1 1 1 1 1 2 2 2 ...
## $ character : chr "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
## $ line : chr "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
## $ words : int 11 14 16 5 16 21 6 22 5 11 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ episode : int 1 1 1 1 1 1 1 1 1 1 ...
## $ parentTconst : chr "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
## $ tconst : chr "tt0583459" "tt0583459" "tt0583459" "tt0583459" ...
## $ averageRating: num 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ...
## $ numVotes : int 6098 6098 6098 6098 6098 6098 6098 6098 6098 6098 ...
To search for missing values we look at the number of missing values per columns in dialouges.
## episode_id line_num scene character line
## 0 0 0 0 61
## words season episode parentTconst tconst
## 0 0 0 0 0
## averageRating numVotes
## 0 0
There appear to be some missing values for line column. We will use visna from extracats library to see the pattern of missing values.
This missing values are due to a different formats in the Git Hub page used for web scraping. For example, lines like “Paolo: (something in Italian)” render a NA line because the scraping code removes everything between parenthesis.
We decided to fill those missing values with “”. By doing so we will keep the register for those characters dialogue.
To answer this question we have used the unsupervised Machine Learning technique of Kmean. Its objective s to label the data based on certain characteristics, in this case, we used the number of words, lines, and scenes. To accomplish this task we have used the libraries cluster, and base. Moreover, we have established established a priori the desired number of labels that we wanted for our data, for practicality terms we decides to set the size of the groups or kmeans.
From the kmean analysis obtain the following separation of characters: + Main Characters: As expected Rachel, Monica, Phoebe, Joey, Chandler and Ross constitute one group that has that on average has 1,680 scenes, 8469 lines and 87,498 words. + Secondary Characters: This group is composed by 33 characters, most of them are recurrent characters and guest stars. The average character in this group has on average 35 scenes, 133 lines and 1,228 words.
Centers:
## Total_scene Total_lines Total_words vcluster
## 1 1679.833333 8469.33333 87783.33333 3
## 2 36.343750 135.81250 1251.81250 1
## 3 2.487365 7.34296 65.11432 2
Friends is a TV show that tells the story of a group of six friends: Monica, Rachel, Phoebe, Chandler, Ross and Joey. Is one of these characters more important than others? We try to answer this question by looking at the number of lines for each of these main characters.
We can see that Rachel is the character with more lines and Phoebe is the character with less lines. Now we focus in the number of words instead of the number of lines.
Rachel and Ross are again the characters that speak the most and Phoebe the one with less words. We can see that Monica was number 3 for number lines but she is number 5 for number of words. This suggests that Monica’s lines tend to be shorter. The opposite happens with Joey. He is number 5 for number of lines, but he is third for number of words. This suggests his lines tend to be longer.
By looking into lines per episode distribution we find the following: * Monica’s distribution looks more narrow that the others. This indicates that there are few episodes in which Monica speaks a lot. * Chandler and Ross have large right tails, we infer that those characters have episodes in which they speak a lot. * Rachel and Ross have wider distributions.
For the Network analysis, a special data structure is required. We established a definition of interaction between characters when they share the same scene. We must mention that the original data structure of the dialogues does not permit us to identify the exact interaction of the characters in the scene. Hence, we have assumed that all the characters that appeared in every scene interacted between them. Moreover, we have assumed that the interactions between the characters will be represented by an adjacency matrix where we can observe the number of interactions that each character has with the others.
With the library igraph) we were able to create the adjacency matrix of the characters and quantify the interactions among the 869 characters.
Interactivity Networks
For topic modelling we will use the package textmineR. We will try to find the topic for each episode. To do so we will create a document for each episode, so we have to group lines by episode_id.
## # A tibble: 6 x 2
## episode_id lines
## <chr> <chr>
## 1 1 : 01 There's nothing to tell! He's just some guy I work with! C'mo…
## 2 1 : 02 What you guys don't understand is, for us, kissing is as impo…
## 3 1 : 03 Hi guys! Hey, Pheebs! Hi! Hey. Oh, oh, how'd it go? Um, not s…
## 4 1 : 04 "Alright. Phoebe? Okay, okay. If I were omnipotent for a day,…
## 5 1 : 05 "Would you let it go? It's not that big a deal. Not that big …
## 6 1 : 06 Ooh! Look! Look! Look! Look, there's Joey's picture! This is …
Function CrateDtm creates a document term matrix. To do so we use a group of stopwords, words we don’t want to use because they are used frequently in English language and do not give insightful information.
We will use document term matrix to create a Term Document Frequency matrix that counts the number of times a term appears (term frequency) and the number of documents in which a term appears (document frequency).
These are the main terms ordered by term frequency:
## term term_freq doc_freq
## 11509 good 1714 231
## 11508 god 1677 228
## 11507 guys 1468 225
## 11506 great 1342 225
## 11505 time 1215 229
## 11504 back 1125 223
Now we fit a Latent Dirichlet allocation model in which we will try to fit 15 topics into the collection of episodes. This will return to main matrices:
## [1] "Theta:"
## t_1 t_2 t_3 t_4 t_5
## 1 : 01 0.003959440 0.2482858522 0.0348623853 0.069628199 0.08025109
## 1 : 02 0.017833456 0.0281503316 0.0325718497 0.237435520 0.03109801
## 1 : 03 0.011225296 0.0001581028 0.1693280632 0.003320158 0.02387352
## 1 : 04 0.004108681 0.0014579192 0.0557985421 0.139297548 0.09158383
## 1 : 05 0.007375271 0.0001446132 0.0001446132 0.111496746 0.11005061
## 1 : 06 0.028275352 0.0031088083 0.0549222798 0.001628423 0.39393042
## [1] "Phi:"
## met_guy gellar meeting_meeting cameras_smell potpourri
## t_1 8.583028e-06 8.583028e-06 8.583028e-06 8.583028e-06 8.583028e-06
## t_2 8.660333e-06 8.660333e-06 8.660333e-06 1.818670e-04 8.660333e-06
## t_3 5.786066e-06 5.786066e-06 5.786066e-06 5.786066e-06 5.786066e-06
## t_4 9.490457e-06 9.490457e-06 9.490457e-06 9.490457e-06 9.490457e-06
## t_5 6.904695e-06 6.904695e-06 6.904695e-06 6.904695e-06 6.904695e-06
## t_6 9.590578e-06 9.590578e-06 9.590578e-06 9.590578e-06 9.590578e-06
Now the 15 topics have been created. To know about the topics quality we look into the topic coherence, this is a measure of how associated are words in a topic.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.002827 0.003716 0.019262 0.035778 0.056138 0.115704
We will use phi to get the top 5 terms per topic.
## [,1] [,2] [,3] [,4] [,5]
## t_1 "sister" "meet" "give" "part" "honey"
## t_2 "wedding" "married" "guys" "love" "parents"
## t_3 "guys" "game" "money" "apartment" "move"
## t_4 "janice" "carol" "guy" "woman" "susan"
## t_5 "guys" "big" "bye" "julie" "listen"
## t_6 "emma" "mike" "guys" "baby" "love"
## t_7 "monkey" "marcel" "people" "joke" "drake"
## t_8 "dad" "birthday" "mom" "guys" "party"
## t_9 "baby" "ring" "pregnant" "guys" "god"
## t_10 "job" "guy" "guys" "great" "good"
## t_11 "cat" "thing" "mark" "god" "love"
## t_12 "guys" "love" "good" "plane" "bob"
## t_13 "guys" "thanksgiving" "year" "dog" "school"
## t_14 "good" "god" "time" "great" "wait"
## t_15 "emily" "married" "love" "london" "pheebs"
The next step is to compute the topic prevalence using theta. Topic prevalence indicate the most frequent topics in the TV show.
Finally, we get a summary for the complete LDA model.
## topic coherence prevalence top_terms
## t_14 t_14 0.000 40.628 good, god, time, great, wait
## t_10 t_10 0.003 6.197 job, guy, guys, great, good
## t_3 t_3 0.009 5.902 guys, game, money, apartment, move
## t_5 t_5 -0.003 5.432 guys, big, bye, julie, listen
## t_8 t_8 0.064 4.690 dad, birthday, mom, guys, party
## t_9 t_9 0.019 4.507 baby, ring, pregnant, guys, god
## t_1 t_1 0.004 4.174 sister, meet, give, part, honey
## t_4 t_4 0.061 4.097 janice, carol, guy, woman, susan
## t_2 t_2 0.050 3.800 wedding, married, guys, love, parents
## t_15 t_15 0.116 3.640 emily, married, love, london, pheebs
## t_7 t_7 0.052 3.601 monkey, marcel, people, joke, drake
## t_11 t_11 0.006 3.494 cat, thing, mark, god, love
## t_13 t_13 0.051 3.460 guys, thanksgiving, year, dog, school
## t_6 t_6 0.108 3.349 emma, mike, guys, baby, love
## t_12 t_12 -0.002 3.028 guys, love, good, plane, bob
We can see that the most prevalent (frequent) topic has words like “good”, “god”,“great”, “time”. This makes sense, this words are very frequent in the TV show and that is why they give very little information about the topic. That is why the coherence is 0.0.
The other topics in the model have less prevalence but they are more coherent. If you are a fan of the show and if you read the list of top terms, we are sure you can remember episodes in which those terms were important.
To find those important episodes we created a d3 tool. We wrote a csv file using theta in which, for each episode and topic we put the probability of that topic given the episode and the top terms of that topic.
## id topic value topic_num
## 1 1 : 01 t_1 0.0039594399 1
## 2 1 : 01 t_14 0.3690004829 14
## 3 1 : 01 t_6 0.0000965717 6
## 4 1 : 01 t_15 0.0020280058 15
## 5 1 : 01 t_13 0.0242394978 13
## 6 1 : 01 t_7 0.0551424433 7
## top_terms name
## 1 sister, meet, give, part, honey Monica Gets A Roommate
## 2 good, god, time, great, wait Monica Gets A Roommate
## 3 emma, mike, guys, baby, love Monica Gets A Roommate
## 4 emily, married, love, london, pheebs Monica Gets A Roommate
## 5 guys, thanksgiving, year, dog, school Monica Gets A Roommate
## 6 monkey, marcel, people, joke, drake Monica Gets A Roommate
We built an interactive tool where the user can search for the episodes that most relate to each one of the 15 topics. To access the tool click here
To look for the main features that drive the rating up or down, it is useful to start by observing its temporal structure. The following graph is interactive:
Click on the season title to (de)activate each series
Next, to create a cleaner view of the data, we create a boxplot
The boxplots not only help to confirm in a cleaner graph the behavior observed previously, but also provides additional insights like:
A likely explanation for the behavior on the last 7 episodes of all seasons is that writers may have prepared some intricate plot (an not necessarily amusing) to be climaxed on the lasts episodes.
We included four elements that are interactive: